{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "rNdWfPXCjTjY"
   },
   "source": [
    "# Improving Data Quality\n",
    "\n",
    "**Learning Objectives**\n",
    "\n",
    "\n",
    "1. Resolve missing values\n",
    "2. Convert the Date feature column to a datetime format\n",
    "3. Rename a feature column, remove a value from a feature column\n",
    "4. Create one-hot encoding features\n",
    "5. Understand temporal feature conversions \n",
    "\n",
    "\n",
    "## Introduction \n",
    "\n",
    "Recall that machine learning models can only consume numeric data, and that numeric data should be \"1\"s or \"0\"s.  Data is said to be \"messy\" or \"untidy\" if it is missing attribute values, contains noise or outliers, has duplicates, wrong data, upper/lower case column names, and is essentially not ready for ingestion by a machine learning algorithm.  \n",
    "\n",
    "This notebook presents and solves some of the most common issues of \"untidy\" data.  Note that different problems will require different methods, and they are beyond the scope of this notebook.\n",
    "\n",
    "Each learning objective will correspond to a _#TODO_ in this student lab notebook -- try to complete this notebook first and then review the [solution notebook](https://github.com/GoogleCloudPlatform/training-data-analyst/blob/master/courses/machine_learning/deepdive2/launching_into_ml/solutions/improve_data_quality.ipynb)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!sudo chown -R jupyter:jupyter /home/jupyter/training-data-analyst"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "VxyBFc_kKazA"
   },
   "source": [
    "Start by importing the necessary libraries for this lab."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Import Libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Importing necessary tensorflow library and printing the TF version.\n",
    "import tensorflow as tf\n",
    "\n",
    "print(\"TensorFlow version: \",tf.version.VERSION)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import pandas as pd  # First, we'll import Pandas, a data processing and CSV file I/O library\n",
    "import numpy as np\n",
    "from datetime import datetime\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load the Dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The dataset is based on California's [Vehicle Fuel Type Count by Zip Code](https://data.ca.gov/dataset/vehicle-fuel-type-count-by-zip-codeSynthetic) report.  The dataset has been modified to make the data \"untidy\" and is thus a synthetic representation that can be used for learning purposes.  \n",
    "  \n",
    "Let's download the raw .csv data by copying the data from a cloud storage bucket.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "if not os.path.isdir(\"../data/transport\"):\n",
    "    os.makedirs(\"../data/transport\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Copying gs://cloud-training/mlongcp/v3.0_MLonGC/toy_data/untidy_vehicle_data_toy.csv...\n",
      "/ [1 files][ 23.7 KiB/ 23.7 KiB]                                                \n",
      "Operation completed over 1 objects/23.7 KiB.                                     \n"
     ]
    }
   ],
   "source": [
    "!gcloud storage cp gs://cloud-training/mlongcp/v3.0_MLonGC/toy_data/untidy_vehicle_data_toy.csv ../data/transport"   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "total 24\n",
      "-rw-r--r-- 1 jupyter jupyter 24263 Dec 23 15:55 untidy_vehicle_data_toy.csv\n"
     ]
    }
   ],
   "source": [
    "!ls -l ../data/transport"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Read Dataset into a Pandas DataFrame"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "lM6-n6xntv3t"
   },
   "source": [
    "Next, let's read in the dataset just copied from the cloud storage bucket and create a Pandas DataFrame.  We also add a Pandas .head() function to show you the top 5 rows of data in the DataFrame. Head() and Tail() are \"best-practice\" functions used to investigate datasets.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 222
    },
    "colab_type": "code",
    "id": "REZ57BXCLdfG",
    "outputId": "a6ef2eda-c7eb-4e2d-92e4-e7fcaa20b0af"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Date</th>\n",
       "      <th>Zip Code</th>\n",
       "      <th>Model Year</th>\n",
       "      <th>Fuel</th>\n",
       "      <th>Make</th>\n",
       "      <th>Light_Duty</th>\n",
       "      <th>Vehicles</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>10/1/2018</td>\n",
       "      <td>90000</td>\n",
       "      <td>2006</td>\n",
       "      <td>Gasoline</td>\n",
       "      <td>OTHER/UNK</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>10/1/2018</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2014</td>\n",
       "      <td>Gasoline</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Yes</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>NaN</td>\n",
       "      <td>90000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Gasoline</td>\n",
       "      <td>OTHER/UNK</td>\n",
       "      <td>Yes</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>10/1/2018</td>\n",
       "      <td>90000</td>\n",
       "      <td>2017</td>\n",
       "      <td>Gasoline</td>\n",
       "      <td>OTHER/UNK</td>\n",
       "      <td>Yes</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>10/1/2018</td>\n",
       "      <td>90000</td>\n",
       "      <td>&lt;2006</td>\n",
       "      <td>Diesel and Diesel Hybrid</td>\n",
       "      <td>OTHER/UNK</td>\n",
       "      <td>No</td>\n",
       "      <td>55.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        Date Zip Code Model Year                      Fuel       Make  \\\n",
       "0  10/1/2018    90000       2006                  Gasoline  OTHER/UNK   \n",
       "1  10/1/2018      NaN       2014                  Gasoline        NaN   \n",
       "2        NaN    90000        NaN                  Gasoline  OTHER/UNK   \n",
       "3  10/1/2018    90000       2017                  Gasoline  OTHER/UNK   \n",
       "4  10/1/2018    90000      <2006  Diesel and Diesel Hybrid  OTHER/UNK   \n",
       "\n",
       "  Light_Duty  Vehicles  \n",
       "0        NaN       1.0  \n",
       "1        Yes       1.0  \n",
       "2        Yes       NaN  \n",
       "3        Yes       1.0  \n",
       "4         No      55.0  "
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_transport = pd.read_csv('../data/transport/untidy_vehicle_data_toy.csv')\n",
    "df_transport.head() # Output the first five rows."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### DataFrame Column Data Types\n",
    "\n",
    "DataFrames may have heterogenous or \"mixed\" data types, that is, some columns are numbers, some are strings, and some are dates etc. Because CSV files do not contain information on what data types are contained in each column, Pandas infers the data types when loading the data, e.g. if a column contains only numbers, Pandas will set that column’s data type to numeric: integer or float.\n",
    "\n",
    "Run the next cell to see information on the DataFrame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 499 entries, 0 to 498\n",
      "Data columns (total 7 columns):\n",
      "Date          497 non-null object\n",
      "Zip Code      497 non-null object\n",
      "Model Year    497 non-null object\n",
      "Fuel          497 non-null object\n",
      "Make          496 non-null object\n",
      "Light_Duty    496 non-null object\n",
      "Vehicles      496 non-null float64\n",
      "dtypes: float64(2), object(5)\n",
      "memory usage: 27.4+ KB\n"
     ]
    }
   ],
   "source": [
    "df_transport.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From what the .info() function shows us, we have six string objects and one float object.  Let's print out the first and last five rows of each column. We can definitely see more of the \"string\" object values now!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "          Date Zip Code Model Year                      Fuel       Make  \\\n",
      "0    10/1/2018    90000       2006                  Gasoline  OTHER/UNK   \n",
      "1    10/1/2018      NaN       2014                  Gasoline        NaN   \n",
      "2          NaN    90000        NaN                  Gasoline  OTHER/UNK   \n",
      "3    10/1/2018    90000       2017                  Gasoline  OTHER/UNK   \n",
      "4    10/1/2018    90000      <2006  Diesel and Diesel Hybrid  OTHER/UNK   \n",
      "..         ...      ...        ...                       ...        ...   \n",
      "494   6/7/2019    90003       2012                  Gasoline     Type_R   \n",
      "495   6/8/2019    90003       2012           Hybrid Gasoline  OTHER/UNK   \n",
      "496   6/9/2019    90003       2012           Hybrid Gasoline     Type_Q   \n",
      "497  6/10/2019    90003       2012               Natural Gas  OTHER/UNK   \n",
      "498  6/11/2019    90003       2012            Plug-in Hybrid  OTHER/UNK   \n",
      "\n",
      "    Light_Duty  Vehicles  \n",
      "0          NaN       1.0  \n",
      "1          Yes       1.0  \n",
      "2          Yes       NaN  \n",
      "3          Yes       1.0  \n",
      "4           No      55.0  \n",
      "..         ...       ...  \n",
      "494        Yes      26.0  \n",
      "495        Yes       4.0  \n",
      "496        Yes      25.0  \n",
      "497        Yes       1.0  \n",
      "498        Yes       3.0  \n",
      "\n",
      "[499 rows x 7 columns] 5\n"
     ]
    }
   ],
   "source": [
    "print(df_transport)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Summary Statistics \n",
    "\n",
    "At this point, we have only one column which contains a numerical value (e.g. Vehicles).  For features which contain numerical values, we are often interested in various statistical measures relating to those values.  We can use .describe() to see some summary statistics for the numeric fields in our dataframe. Note, that because we only have one numeric feature, we see only one summary stastic - for now.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
     {
      "data": {
       "text/html": [
        "<div>\n",
        "<style scoped>\n",
        "    .dataframe tbody tr th:only-of-type {\n",
        "        vertical-align: middle;\n",
        "    }\n",
        "\n",
        "    .dataframe tbody tr th {\n",
        "        vertical-align: top;\n",
        "    }\n",
        "\n",
        "    .dataframe thead th {\n",
        "        text-align: right;\n",
        "    }\n",
        "</style>\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>Zip Code</th>\n",
        "      <th>Vehicles</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>count</th>\n",
        "      <td>497.00000</td>\n",
        "      <td>496.000000</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>mean</th>\n",
        "      <td>89838.23340</td>\n",
        "      <td>74.512097</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>std</th>\n",
        "      <td>3633.35609</td>\n",
        "      <td>243.839871</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>min</th>\n",
        "      <td>9001.00000</td>\n",
        "      <td>1.000000</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>25%</th>\n",
        "      <td>90001.00000</td>\n",
        "      <td>14.000000</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>50%</th>\n",
        "      <td>90001.00000</td>\n",
        "      <td>25.000000</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>75%</th>\n",
        "      <td>90001.00000</td>\n",
        "      <td>56.250000</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>max</th>\n",
        "      <td>90002.00000</td>\n",
        "      <td>3178.000000</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "text/plain": [
        "          Zip Code     Vehicles\n",
        "count    497.00000   496.000000\n",
        "mean   89838.23340    74.512097\n",
        "std     3633.35609   243.839871\n",
        "min     9001.00000     1.000000\n",
        "25%    90001.00000    14.000000\n",
        "50%    90001.00000    25.000000\n",
        "75%    90001.00000    56.250000\n",
        "max    90002.00000  3178.000000"
       ]
      },
      "execution_count": 10,
      "metadata": {},
      "output_type": "execute_result"
     }
   ],
   "source": [
    "df_transport.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's investigate a bit more of our data by using the .groupby() function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
     {
      "data": {
       "text/html": [
        "<div>\n",
        "<style scoped>\n",
        "    .dataframe tbody tr th:only-of-type {\n",
        "        vertical-align: middle;\n",
        "    }\n",
        "\n",
        "    .dataframe tbody tr th {\n",
        "        vertical-align: top;\n",
        "    }\n",
        "\n",
        "    .dataframe thead th {\n",
        "        text-align: right;\n",
        "    }\n",
        "</style>\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>Date</th>\n",
        "      <th>Zip Code</th>\n",
        "      <th>Model Year</th>\n",
        "      <th>Make</th>\n",
        "      <th>Light_Duty</th>\n",
        "      <th>Vehicles</th>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>Fuel</th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>Battery Electric</th>\n",
        "      <td>10/1/2018</td>\n",
        "      <td>90000.0</td>\n",
        "      <td>&lt;2006</td>\n",
        "      <td>OTHER/UNK</td>\n",
        "      <td>No</td>\n",
        "      <td>4.0</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>Diesel and Diesel Hybrid</th>\n",
        "      <td>10/1/2018</td>\n",
        "      <td>90000.0</td>\n",
        "      <td>&lt;2006</td>\n",
        "      <td>OTHER/UNK</td>\n",
        "      <td>No</td>\n",
        "      <td>55.0</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>Flex-Fuel</th>\n",
        "      <td>10/14/2018</td>\n",
        "      <td>90001.0</td>\n",
        "      <td>2007</td>\n",
        "      <td>Type_A</td>\n",
        "      <td>Yes</td>\n",
        "      <td>78.0</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>Gasoline</th>\n",
        "      <td>10/1/2018</td>\n",
        "      <td>90000.0</td>\n",
        "      <td>2006</td>\n",
        "      <td>OTHER/UNK</td>\n",
        "      <td>Yes</td>\n",
        "      <td>1.0</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>Hybrid Gasoline</th>\n",
        "      <td>10/24/2018</td>\n",
        "      <td>90001.0</td>\n",
        "      <td>2009</td>\n",
        "      <td>OTHER/UNK</td>\n",
        "      <td>Yes</td>\n",
        "      <td>18.0</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>Natural Gas</th>\n",
        "      <td>10/25/2018</td>\n",
        "      <td>90001.0</td>\n",
        "      <td>2009</td>\n",
        "      <td>OTHER/UNK</td>\n",
        "      <td>No</td>\n",
        "      <td>2.0</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>Other</th>\n",
        "      <td>10/8/2018</td>\n",
        "      <td>90000.0</td>\n",
        "      <td>&lt;2006</td>\n",
        "      <td>OTHER/UNK</td>\n",
        "      <td>Yes</td>\n",
        "      <td>6.0</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>Plug-in Hybrid</th>\n",
        "      <td>11/2/2018</td>\n",
        "      <td>90001.0</td>\n",
        "      <td>2012</td>\n",
        "      <td>OTHER/UNK</td>\n",
        "      <td>Yes</td>\n",
        "      <td>1.0</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "text/plain": [
        "                                Date  Zip Code Model Year       Make  \\\n",
        "Fuel                                                                   \n",
        "Battery Electric           10/1/2018   90000.0      <2006  OTHER/UNK   \n",
        "Diesel and Diesel Hybrid   10/1/2018   90000.0      <2006  OTHER/UNK   \n",
        "Flex-Fuel                 10/14/2018   90001.0       2007     Type_A   \n",
        "Gasoline                   10/1/2018   90000.0       2006  OTHER/UNK   \n",
        "Hybrid Gasoline           10/24/2018   90001.0       2009  OTHER/UNK   \n",
        "Natural Gas               10/25/2018   90001.0       2009  OTHER/UNK   \n",
        "Other                      10/8/2018   90000.0      <2006  OTHER/UNK   \n",
        "Plug-in Hybrid             11/2/2018   90001.0       2012  OTHER/UNK   \n",
        "\n",
        "                         Light_Duty  Vehicles  \n",
        "Fuel                                           \n",
        "Battery Electric                 No       4.0  \n",
        "Diesel and Diesel Hybrid         No      55.0  \n",
        "Flex-Fuel                       Yes      78.0  \n",
        "Gasoline                        Yes       1.0  \n",
        "Hybrid Gasoline                 Yes      18.0  \n",
        "Natural Gas                      No       2.0  \n",
        "Other                           Yes       6.0  \n",
        "Plug-in Hybrid                  Yes       1.0  "
       ]
      },
      "execution_count": 11,
      "metadata": {},
      "output_type": "execute_result"
     }
   ],
   "source": [
    "df_transport.groupby('Fuel').first() # Get the first entry for each month. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Checking for Missing Values\n",
    "\n",
    "Missing values adversely impact data quality, as they can lead the machine learning model to make inaccurate inferences about the data. Missing values can be the result of numerous factors, e.g. \"bits\" lost during streaming transmission, data entry, or perhaps a user forgot to fill in a field.  Note that Pandas recognizes both empty cells and “NaN” types as missing values. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's show the null values for all features in the DataFrame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
     {
      "data": {
       "text/plain": [
        "Date          2\n",
        "Zip Code      2\n",
        "Model Year    2\n",
        "Fuel          2\n",
        "Make          3\n",
        "Light_Duty    3\n",
        "Vehicles      3\n",
        "dtype: int64"
       ]
      },
      "execution_count": 12,
      "metadata": {},
      "output_type": "execute_result"
     }
   ],
   "source": [
    "df_transport.isnull().sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To see a sampling of which values are missing, enter the feature column name.  You'll notice that \"False\" and \"True\" correpond to the presence or abscence of a value by index number."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
       "0      10/1/2018\n",
       "1      10/1/2018\n",
       "2            NaN\n",
       "3      10/1/2018\n",
       "4      10/1/2018\n",
       "         ...    \n",
       "494    12/3/2018\n",
       "495    12/4/2018\n",
       "496    12/5/2018\n",
       "497    12/6/2018\n",
       "498    12/7/2018\n",
       "Name: Date, Length: 499, dtype: object\n",
       "0      False\n",
       "1      False\n",
       "2       True\n",
       "3      False\n",
       "4      False\n",
       "       ...  \n",
       "494    False\n",
       "495    False\n",
       "496    False\n",
       "497    False\n",
       "498    False\n",
       "Name: Date, Length: 499, dtype: bool\n"
      ]
     }
   ],
   "source": [
    "print (df_transport['Date'])\n",
    "print (df_transport['Date'].isnull())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
       "0      OTHER/UNK\n",
       "1            NaN\n",
       "2      OTHER/UNK\n",
       "3      OTHER/UNK\n",
       "4      OTHER/UNK\n",
       "         ...    \n",
       "494       Type_I\n",
       "495       Type_B\n",
       "496       Type_C\n",
       "497       Type_J\n",
       "498       Type_J\n",
       "Name: Make, Length: 499, dtype: object\n",
       "0      False\n",
       "1       True\n",
       "2      False\n",
       "3      False\n",
       "4      False\n",
       "       ...  \n",
       "494    False\n",
       "495    False\n",
       "496    False\n",
       "497    False\n",
       "498    False\n",
       "Name: Make, Length: 499, dtype: bool\n"
      ]
     }
   ],
   "source": [
    "print (df_transport['Make'])\n",
    "print (df_transport['Make'].isnull())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
       "0       2006\n",
       "1       2014\n",
       "2        NaN\n",
       "3       2017\n",
       "4      <2006\n",
       "       ...  \n",
       "494     2010\n",
       "495     2010\n",
       "496     2010\n",
       "497     2010\n",
       "498     2010\n",
       "Name: Model Year, Length: 499, dtype: object\n",
       "0      False\n",
       "1      False\n",
       "2       True\n",
       "3      False\n",
       "4      False\n",
       "       ...  \n",
       "494    False\n",
       "495    False\n",
       "496    False\n",
       "497    False\n",
       "498    False\n",
       "Name: Model Year, Length: 499, dtype: bool\n"
      ]
     }
   ],
   "source": [
    "print (df_transport['Model Year'])\n",
    "print (df_transport['Model Year'].isnull())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### What can we deduce about the data at this point?\n",
    "\n",
    "First, let's summarize our data by row, column, features, unique, and missing values,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
       "Rows     :  499\n",
       "Columns  :  7\n",
       "\n",
       "Features : \n",
       " ['Date', 'Zip Code', 'Model Year', 'Fuel', 'Make', 'Light_Duty', 'Vehicles']\n",
       "\n",
       "Unique values :  \n",
       " Date          130\n",
       "Zip Code        4\n",
       "Model Year     15\n",
       "Fuel            8\n",
       "Make           43\n",
       "Light_Duty      2\n",
       "Vehicles      151\n",
       "dtype: int64\n",
       "\n",
       "Missing values :   17\n"
      ]
     }
   ],
   "source": [
    "print (\"Rows     : \" ,df_transport.shape[0])\n",
    "print (\"Columns  : \" ,df_transport.shape[1])\n",
    "print (\"\\nFeatures : \\n\" ,df_transport.columns.tolist())\n",
    "print (\"\\nUnique values :  \\n\",df_transport.nunique())\n",
    "print (\"\\nMissing values :  \", df_transport.isnull().sum().values.sum())\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's see the data again -- this time the last five rows in the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
     {
      "data": {
       "text/html": [
        "<div>\n",
        "<style scoped>\n",
        "    .dataframe tbody tr th:only-of-type {\n",
        "        vertical-align: middle;\n",
        "    }\n",
        "\n",
        "    .dataframe tbody tr th {\n",
        "        vertical-align: top;\n",
        "    }\n",
        "\n",
        "    .dataframe thead th {\n",
        "        text-align: right;\n",
        "    }\n",
        "</style>\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>Date</th>\n",
        "      <th>Zip Code</th>\n",
        "      <th>Model Year</th>\n",
        "      <th>Fuel</th>\n",
        "      <th>Make</th>\n",
        "      <th>Light_Duty</th>\n",
        "      <th>Vehicles</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>494</th>\n",
        "      <td>12/3/2018</td>\n",
        "      <td>90002.0</td>\n",
        "      <td>2010</td>\n",
        "      <td>Gasoline</td>\n",
        "      <td>Type_I</td>\n",
        "      <td>Yes</td>\n",
        "      <td>11.0</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>495</th>\n",
        "      <td>12/4/2018</td>\n",
        "      <td>90002.0</td>\n",
        "      <td>2010</td>\n",
        "      <td>Gasoline</td>\n",
        "      <td>Type_B</td>\n",
        "      <td>Yes</td>\n",
        "      <td>58.0</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>496</th>\n",
        "      <td>12/5/2018</td>\n",
        "      <td>90002.0</td>\n",
        "      <td>2010</td>\n",
        "      <td>Gasoline</td>\n",
        "      <td>Type_C</td>\n",
        "      <td>Yes</td>\n",
        "      <td>45.0</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>497</th>\n",
        "      <td>12/6/2018</td>\n",
        "      <td>90002.0</td>\n",
        "      <td>2010</td>\n",
        "      <td>Gasoline</td>\n",
        "      <td>Type_J</td>\n",
        "      <td>Yes</td>\n",
        "      <td>82.0</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>498</th>\n",
        "      <td>12/7/2018</td>\n",
        "      <td>90002.0</td>\n",
        "      <td>2010</td>\n",
        "      <td>Gasoline</td>\n",
        "      <td>Type_J</td>\n",
        "      <td>Yes</td>\n",
        "      <td>12.0</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "text/plain": [
        "          Date  Zip Code Model Year      Fuel    Make Light_Duty  Vehicles\n",
        "494  12/3/2018   90002.0       2010  Gasoline  Type_I        Yes      11.0\n",
        "495  12/4/2018   90002.0       2010  Gasoline  Type_B        Yes      58.0\n",
        "496  12/5/2018   90002.0       2010  Gasoline  Type_C        Yes      45.0\n",
        "497  12/6/2018   90002.0       2010  Gasoline  Type_J        Yes      82.0\n",
        "498  12/7/2018   90002.0       2010  Gasoline  Type_J        Yes      12.0"
       ]
      },
      "execution_count": 17,
      "metadata": {},
      "output_type": "execute_result"
     }
   ],
   "source": [
    "df_transport.tail()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### What Are Our Data Quality Issues?\n",
    "\n",
    "1. **Data Quality Issue #1**:  \n",
    "> **Missing Values**:\n",
    "Each feature column has multiple missing values.  In fact, we have a total of 18 missing values.\n",
    "2. **Data Quality Issue #2**: \n",
    "> **Date DataType**:  Date is shown as an \"object\" datatype and should be a datetime.  In addition, Date is in one column.  Our business requirement is to see the Date parsed out to year, month, and day.  \n",
    "3. **Data Quality Issue #3**: \n",
    "> **Model Year**: We are only interested in years greater than 2006, not \"<2006\".\n",
    "4. **Data Quality Issue #4**:  \n",
    "> **Categorical Columns**:  The feature column \"Light_Duty\" is categorical and has a \"Yes/No\" choice.  We cannot feed values like this into a machine learning model.  In addition, we need to \"one-hot encode the remaining \"string\"/\"object\" columns.\n",
    "5. **Data Quality Issue #5**:  \n",
    "> **Temporal Features**:  How do we handle year, month, and day?\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Data Quality Issue #1:  \n",
    "##### Resolving Missing Values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Most algorithms do not accept missing values.  Yet, when we see missing values in our dataset, there is always a tendency to just \"drop all the rows\" with missing values.  Although Pandas will fill in the blank space with “NaN\", we should \"handle\" them in some way.\n",
    "\n",
    "While all the methods to handle missing values is beyond the scope of this lab, there are a few methods you should consider.  For numeric columns, use the \"mean\" values to fill in the missing numeric values.  For categorical columns, use the \"mode\" (or most frequent values) to fill in missing categorical values. \n",
    "\n",
    "In this lab, we use the .apply and Lambda functions to fill every column with its own most frequent value.  You'll learn more about Lambda functions later in the lab."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's check again for missing values by showing how many rows contain NaN values for each feature column."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Lab Task #1a:** Check for missing values by showing how many rows contain NaN values for each feature column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
     {
      "data": {
       "text/plain": [
        "Date          2\n",
        "Zip Code      2\n",
        "Model Year    2\n",
        "Fuel          2\n",
        "Make          3\n",
        "Light_Duty    3\n",
        "Vehicles      3\n",
        "dtype: int64"
       ]
      },
      "execution_count": 18,
      "metadata": {},
      "output_type": "execute_result"
     }
   ],
   "source": [
    "# TODO 1a\n",
    "# TODO -- Your code here.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Lab Task #1b:** Apply the lambda function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO 1b\n",
    "# TODO -- Your code here.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Lab Task #1c:** Check again for missing values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
     {
      "data": {
       "text/plain": [
        "Date          0\n",
        "Zip Code      0\n",
        "Model Year    0\n",
        "Fuel          0\n",
        "Make          0\n",
        "Light_Duty    0\n",
        "Vehicles      0\n",
        "dtype: int64"
       ]
      },
      "execution_count": 20,
      "metadata": {},
      "output_type": "execute_result"
     }
   ],
   "source": [
    "# TODO 1c\n",
    "# TODO -- Your code here.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Data Quality Issue #2:  \n",
    "##### Convert the Date Feature Column to a Datetime Format"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The date column is indeed shown as a string object."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Lab Task #2a:** Convert the datetime datatype with the to_datetime() function in Pandas."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO 2a\n",
    "# TODO -- Your code here.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Lab Task #2b:** Show the converted Date."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
       "<class 'pandas.core.frame.DataFrame'>\n",
       "RangeIndex: 499 entries, 0 to 498\n",
       "Data columns (total 7 columns):\n",
       " #   Column      Non-Null Count  Dtype         \n",
       "---  ------      --------------  -----         \n",
       " 0   Date        499 non-null    datetime64[ns]\n",
       " 1   Zip Code    499 non-null    float64       \n",
       " 2   Model Year  499 non-null    object        \n",
       " 3   Fuel        499 non-null    object        \n",
       " 4   Make        499 non-null    object        \n",
       " 5   Light_Duty  499 non-null    object        \n",
       " 6   Vehicles    499 non-null    float64       \n",
       "dtypes: datetime64[ns](1), float64(2), object(4)\n",
       "memory usage: 27.4+ KB\n"
      ]
     }
   ],
   "source": [
    "# TODO 2b\n",
    "# TODO -- Your code here.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's parse Date into three columns, e.g. year, month, and day."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
       "<class 'pandas.core.frame.DataFrame'>\n",
       "RangeIndex: 499 entries, 0 to 498\n",
       "Data columns (total 10 columns):\n",
       " #   Column      Non-Null Count  Dtype         \n",
       "---  ------      --------------  -----         \n",
       " 0   Date        499 non-null    datetime64[ns]\n",
       " 1   Zip Code    499 non-null    float64       \n",
       " 2   Model Year  499 non-null    object        \n",
       " 3   Fuel        499 non-null    object        \n",
       " 4   Make        499 non-null    object        \n",
       " 5   Light_Duty  499 non-null    object        \n",
       " 6   Vehicles    499 non-null    float64       \n",
       " 7   year        499 non-null    int64         \n",
       " 8   month       499 non-null    int64         \n",
       " 9   day         499 non-null    int64         \n",
       "dtypes: datetime64[ns](1), float64(2), int64(3), object(4)\n",
       "memory usage: 39.1+ KB\n"
      ]
     }
   ],
   "source": [
    "df_transport['year'] = df_transport['Date'].dt.year\n",
    "df_transport['month'] = df_transport['Date'].dt.month\n",
    "df_transport['day'] = df_transport['Date'].dt.day\n",
    "#df['hour'] = df['date'].dt.hour - you could use this if your date format included hour.\n",
    "#df['minute'] = df['date'].dt.minute - you could use this if your date format included minute.\n",
    "df_transport.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, let's confirm the Date parsing.  This will also give us a another visualization of the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
     {
      "data": {
       "text/html": [
        "<div>\n",
        "<style scoped>\n",
        "    .dataframe tbody tr th:only-of-type {\n",
        "        vertical-align: middle;\n",
        "    }\n",
        "\n",
        "    .dataframe tbody tr th {\n",
        "        vertical-align: top;\n",
        "    }\n",
        "\n",
        "    .dataframe thead th {\n",
        "        text-align: right;\n",
        "    }\n",
        "</style>\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>Date</th>\n",
        "      <th>Zip Code</th>\n",
        "      <th>Model Year</th>\n",
        "      <th>Fuel</th>\n",
        "      <th>Make</th>\n",
        "      <th>Light_Duty</th>\n",
        "      <th>Vehicles</th>\n",
        "      <th>year</th>\n",
        "      <th>day</th>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>month</th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>1</th>\n",
        "      <td>2019-01-01</td>\n",
        "      <td>90001.0</td>\n",
        "      <td>2016</td>\n",
        "      <td>Gasoline</td>\n",
        "      <td>Type_G</td>\n",
        "      <td>Yes</td>\n",
        "      <td>18.0</td>\n",
        "      <td>2019</td>\n",
        "      <td>1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>2</th>\n",
        "      <td>2019-02-01</td>\n",
        "      <td>90001.0</td>\n",
        "      <td>2017</td>\n",
        "      <td>Gasoline</td>\n",
        "      <td>Type_D</td>\n",
        "      <td>Yes</td>\n",
        "      <td>13.0</td>\n",
        "      <td>2019</td>\n",
        "      <td>1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3</th>\n",
        "      <td>2019-03-01</td>\n",
        "      <td>90001.0</td>\n",
        "      <td>2018</td>\n",
        "      <td>Gasoline</td>\n",
        "      <td>Type_C</td>\n",
        "      <td>Yes</td>\n",
        "      <td>32.0</td>\n",
        "      <td>2019</td>\n",
        "      <td>1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>10</th>\n",
        "      <td>2018-10-01</td>\n",
        "      <td>90000.0</td>\n",
        "      <td>2006</td>\n",
        "      <td>Gasoline</td>\n",
        "      <td>OTHER/UNK</td>\n",
        "      <td>Yes</td>\n",
        "      <td>1.0</td>\n",
        "      <td>2018</td>\n",
        "      <td>1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>11</th>\n",
        "      <td>2018-11-01</td>\n",
        "      <td>90001.0</td>\n",
        "      <td>2007</td>\n",
        "      <td>Gasoline</td>\n",
        "      <td>Type_M</td>\n",
        "      <td>Yes</td>\n",
        "      <td>15.0</td>\n",
        "      <td>2018</td>\n",
        "      <td>1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>12</th>\n",
        "      <td>2018-12-02</td>\n",
        "      <td>90001.0</td>\n",
        "      <td>2015</td>\n",
        "      <td>Gasoline</td>\n",
        "      <td>Type_G</td>\n",
        "      <td>Yes</td>\n",
        "      <td>19.0</td>\n",
        "      <td>2018</td>\n",
        "      <td>2</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "text/plain": [
        "            Date  Zip Code Model Year      Fuel       Make Light_Duty  \\\n",
        "month                                                                   \n",
        "1     2019-01-01   90001.0       2016  Gasoline     Type_G        Yes   \n",
        "2     2019-02-01   90001.0       2017  Gasoline     Type_D        Yes   \n",
        "3     2019-03-01   90001.0       2018  Gasoline     Type_C        Yes   \n",
        "10    2018-10-01   90000.0       2006  Gasoline  OTHER/UNK        Yes   \n",
        "11    2018-11-01   90001.0       2007  Gasoline     Type_M        Yes   \n",
        "12    2018-12-02   90001.0       2015  Gasoline     Type_G        Yes   \n",
        "\n",
        "       Vehicles  year  day  \n",
        "month                       \n",
        "1          18.0  2019    1  \n",
        "2          13.0  2019    1  \n",
        "3          32.0  2019    1  \n",
        "10          1.0  2018    1  \n",
        "11         15.0  2018    1  \n",
        "12         19.0  2018    2  "
       ]
      },
      "execution_count": 24,
      "metadata": {},
      "output_type": "execute_result"
     }
   ],
   "source": [
    "# Here, we are creating a new dataframe called \"grouped_data\" and grouping by on the column \"Make\"\n",
    "grouped_data = df_transport.groupby(['Make'])\n",
    "\n",
    "# Get the first entry for each month.\n",
    "df_transport.groupby('month').first()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have Dates as a integers, let's do some additional plotting."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<seaborn.axisgrid.JointGrid at 0x7f559d06e668>"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "<Figure size 720x432 with 0 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAbIAAAGoCAYAAAAjPmDhAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvnQurowAAIABJREFUeJzt3X+U3XV95/Hnm2TU8cc6oCklQygspvFgrcSdA3TxdK0UE6mV6Fm7gNuy1lO6p1i19aQl2lOwXVZ6YrU/1rKLNYoVpIgxRqXGFOx2+wMkJUAINJKCQCYIQRy0MoUwvPeP+51wE2Ymd2bur893no9z7pl73/d7731/w3Bf8/3ez/18IjORJKlUR/S6AUmS5sMgkyQVzSCTJBXNIJMkFc0gkyQVzSCTJBXNIJMkFc0gkyQVzSCTJBVtca8b6BCnK5FUB9HrBkrgEZkkqWh1PSKbs6tvfqDXLUzrvFOP63ULktR3PCKTJBXNIJMkFc0gkyQVzSCTJBXNIJMkFc0gkyQVzSCTJBXNIJMkFc0gkyQVzSCTJBXNIJMkFc0gkyQVzSCTJBXNIJMkFc0gkyQVzSCTJBXNIJMkFc0gkyQVzSCTJBXNIJMkFc0gkyQVzSCTJBXNIJMkFW1xrxuQ1P+uvvmBXrdQtPNOPa7XLdSaR2SSpKIZZJKkohlkkqSiGWSSpKI52KMg/f6Bux9oS+oFj8gkSUUzyCRJRTPIJElFM8gkSUUzyCRJRTPIJElFM8gkSUUzyCRJRTPIJElFM8gkSUUzyCRJRTPIJElFM8gkSUVz9nu1jbPzS+oFj8gkSUXziEwLhkeMUj15RCZJKppBJkkqmkEmSSqaQSZJKppBJkkqmkEmSSpaZGave2i7iPga8PJe9zGDlwOP9rqJNqnLvtRlP6A++1KX/YC578ujmbm63c3UTS2DrN9FxLbMHOl1H+1Ql32py35AffalLvsB9dqXfuSpRUlS0QwySVLRDLLeuKLXDbRRXfalLvsB9dmXuuwH1Gtf+o6fkUmSiuYRmSSpaAaZJKloBpkkqWgGmSSpaLUMstWrVyfgxYsXL6VfWlbD972W1TLIHn20LrPaSFJrFvL7Xi2DTJK0cBhkkqSiGWSSpKIZZJKkohlkkqSiGWSSpKIZZJKkohlkkqSiGWSSpKIt7nUDC8mm7aOs37KLvWPjLB0aZO2qFaxZOdzrtiSpaAZZl2zaPsq6jTsY3z8BwOjYOOs27gAwzCRpHjy12CXrt+w6EGKTxvdPsH7Lrh51JEn1YJB1yd6x8VnVJUmtMci6ZOnQ4KzqkqTWGGRdsnbVCgYHFh1UGxxYxNpVK3rUkSTVg4M9umRyQIejFiWpvQyyLlqzctjgkqQ289SiJKloBpkkqWgGmSSpaAaZJKloBpkkqWgGmSSpaAaZJKloBpkkqWgdC7KIeEFEfDMibo+InRHxoap+QkTcHBG7I+IvI+J5Vf351e3d1f3HNz3Xuqq+KyJWdapnSVJ5OnlE9iTwhsx8DXAysDoiTgP+APhYZr4C+B7wrmr7dwHfq+ofq7YjIk4CzgFeBawG/iwiDp60UJK0YHUsyLLhX6ubA9UlgTcA11X1K4E11fWzq9tU958REVHVr8nMJzPzPmA3cEqn+pYklaWjn5FFxKKIuA14BNgK/AswlplPV5vsASYnHxwGHgSo7n8ceFlzfYrHNL/WBRGxLSK27du3rxO7I0l9xfe9ho4GWWZOZObJwLE0jqJe2cHXuiIzRzJzZMmSJZ16GUnqG77vNXRl9vvMHIuIbwA/BQxFxOLqqOtYYLTabBRYBuyJiMXAS4HvNtUnNT9Gh9i0fdSlYiQtKJ0ctbgkIoaq64PAmcDdwDeA/1xtdj7wper65uo21f03ZmZW9XOqUY0nAMuBb3aq75Jt2j7Kuo07GB0bJ4HRsXHWbdzBpu3mvqT66uSpxWOAb0TEHcAtwNbM/Arw28BvRsRuGp+BfbLa/pPAy6r6bwIXAWTmTuBa4C7ga8CFmTnRwb6LtX7LLsb3H/xPM75/gvVbdvWoI0nqvI6dWszMO4CVU9TvZYpRh5n5b8Dbp3muS4FL291j3ewdG59VXZLqwJk9amTp0OCs6pJUBwZZjaxdtYLBgYO/Kz44sIi1q1b0qCNJ6ryujFpUd0yOTnTUoqSFxCCrmTUrhw0uSQuKpxYlSUUzyCRJRTPIJElFM8gkSUUzyCRJRTPIJElFM8gkSUUzyCRJRTPIJElFM8gkSUUzyCRJRTPIJElFM8gkSUUzyCRJRTPIJElFM8gkSUUzyCRJRTPIJElFM8gkSUUzyCRJRTPIJElFM8gkSUUzyCRJRTPIJElFM8gkSUUzyCRJRTPIJElFM8gkSUUzyCRJRVvc6wb6xabto6zfsou9Y+MsHRpk7aoVrFk53Ou2JEmHYZDRCLF1G3cwvn8CgNGxcdZt3AFgmElSn/PUIrB+y64DITZpfP8E67fs6lFHkqRWdSzIImJZRHwjIu6KiJ0R8d6qfklEjEbEbdXlrKbHrIuI3RGxKyJWNdVXV7XdEXFRu3vdOzY+q7okqX908tTi08D7M/PWiHgJ8E8RsbW672OZ+ZHmjSPiJOAc4FXAUuCvI+LHq7s/DpwJ7AFuiYjNmXlXuxpdOjTI6BShtXRosF0vIUnqkI4dkWXmQ5l5a3X9B8DdwEwfOJ0NXJOZT2bmfcBu4JTqsjsz783Mp4Brqm3bZu2qFQwOLDqoNjiwiLWrVrTzZSRJHdCVz8gi4nhgJXBzVXp3RNwRERsi4siqNgw82PSwPVVtuvqhr3FBRGyLiG379u2bVX9rVg7z4be9muGhQQIYHhrkw297tQM9JPW1+bzv1UnHRy1GxIuBLwDvy8zvR8TlwO8DWf38Q+CX5/s6mXkFcAXAyMhIzvbxa1YOG1ySijLf97266GiQRcQAjRC7KjM3AmTmw033fwL4SnVzFFjW9PBjqxoz1CVJC1wnRy0G8Eng7sz8aFP9mKbN3grcWV3fDJwTEc+PiBOA5cA3gVuA5RFxQkQ8j8aAkM2d6luSVJZOHpGdDvwisCMibqtqHwDOjYiTaZxa/DbwqwCZuTMirgXuojHi8cLMnACIiHcDW4BFwIbM3NnBviVJBelYkGXm3wExxV3Xz/CYS4FLp6hfP9PjJEkLlzN7SJKKZpBJkopmkEmSimaQSZKKZpBJkopmkEmSiubCmhVXiJakMhlkuEK0JJXMU4u4QrQklcwgwxWiJalkBhnTrwTtCtGS1P8MMlwhWpJK5mAPnh3Q4ahFSSqPQVZxhWhJKpOnFiVJRTPIJElFM8gkSUUzyCRJRTPIJElFc9RixUmDJalMBhlOGixJJfPUIk4aLEklM8hw0mBJKplBhpMGS1LJDDKcNFiSSuZgD5w0WJJKZpBVnDRYksrkqUVJUtEMMklS0QwySVLRDDJJUtEMMklS0Ry1WHHSYEkqk0GGkwZLUsk8tYiTBktSyToWZBGxLCK+ERF3RcTOiHhvVT8qIrZGxD3VzyOrekTEn0TE7oi4IyJe2/Rc51fb3xMR57e7VycNlqRydfKI7Gng/Zl5EnAacGFEnARcBNyQmcuBG6rbAG8ClleXC4DLoRF8wMXAqcApwMWT4dcuThosSeXqWJBl5kOZeWt1/QfA3cAwcDZwZbXZlcCa6vrZwGey4SZgKCKOAVYBWzPzscz8HrAVWN3OXp00WJLK1ZXBHhFxPLASuBk4OjMfqu76DnB0dX0YeLDpYXuq2nT1Q1/jAhpHchx33HGz6s9JgyWVaD7ve3XS8SCLiBcDXwDel5nfj4gD92VmRkS243Uy8wrgCoCRkZFZP6eTBksqzXzf9+qio6MWI2KARohdlZkbq/LD1SlDqp+PVPVRYFnTw4+tatPVJUnq6KjFAD4J3J2ZH226azMwOfLwfOBLTfVfqkYvngY8Xp2C3AK8MSKOrAZ5vLGqSZLU0VOLpwO/COyIiNuq2geAy4BrI+JdwP3AL1T3XQ+cBewGngDeCZCZj0XE7wO3VNv9XmY+1sG+JUkF6ViQZebfATHN3WdMsX0CF07zXBuADe3rTpJUF87sIUkqmkEmSSqaQSZJKppBJkkqmsu4VFyPTJLKZJDhemSSVDJPLeJ6ZJJUMoMM1yOTpJIZZLgemSSVzCDD9cgkqWQO9qB765E5MlKS2s8gq3R6PTJHRkpSZ3hqsUscGSlJnXHYIIuIEyPi+dX110fEeyJiqPOt1YsjIyWpM1o5IvsCMBERr6CxpPYy4OqOdlVDjoyUpM5oJcieycyngbcCf5qZa4FjOttW/TgyUpI6o5XBHvsj4lzgfODnq9pA51qqp26NjJSkhaaVIHsn8N+BSzPzvog4AfiLzrZVT50eGSlJC9Fhgywz74qI3waOq27fB/xBpxuTJKkVrYxa/HngNuBr1e2TI2JzpxuTJKkVrQz2uAQ4BRgDyMzbgH/fwZ4kSWpZK0G2PzMfP6T2TCeakSRptloZ7LEzIs4DFkXEcuA9wD90ti1JklrTyhHZrwOvAp4EPgd8H3hfJ5uSJKlVrYxafAL4YHWRJKmvTBtkEfFlIKe7PzPf0pGOJEmahZmOyD7StS4kSZqjaYMsM/8vQES8CBjPzGeq24uA53enPUmSZtbKYI8bgBc23R4E/roz7UiSNDutBNkLMvNfJ29U1184w/aSJHVNK0H2w4h47eSNiPgPgKtBSpL6QitfiH4f8PmI2AsE8KPAf+loV5IktaiV75HdEhGvBCZXgNyVmfs725YkSa2Z6Xtkb8jMGyPibYfc9eMRQWZu7HBvkiQd1kxHZP8JuJFnV4VuloBBJknquZm+R3Zx9fOd3Wun3jZtH2X9ll3sHRtn6dAga1etcMVoSZqnVhbWfH5EnBcRH4iI3528tPC4DRHxSETc2VS7JCJGI+K26nJW033rImJ3ROyKiFVN9dVVbXdEXDSXnewHm7aPsm7jDkbHxklgdGycdRt3sGn7aK9bk6SitTL8/kvA2cDTwA+bLofzaWD1FPWPZebJ1eV6gIg4CTiHxiz7q4E/i4hF1SwiHwfeBJwEnFttW5z1W3Yxvn/ioNr4/gnWb9nVo44kqR5aGX5/bGZOFUgzysy/jYjjW9z8bOCazHwSuC8idtNYlRpgd2beCxAR11Tb3jXbfnpt79jUX72bri5Jak0rR2T/EBGvbuNrvjsi7qhOPR5Z1YaBB5u22VPVpqs/R0RcEBHbImLbvn372thueywdGpxVXZIOp9/f97pl2iCLiB0RcQfwOuDW6nOqO5rqc3E5cCJwMvAQ8IdzfJ7nyMwrMnMkM0eWLFnSrqdtm7WrVjA4sOig2uDAItauWjHNIyRpZv3+vtctM51afHO7XywzH568HhGfAL5S3RwFljVtemxVY4Z6USZHJzpqUZLaa6bh9/dPXo+I1wHLM/NTEbEEePFcXiwijsnMh6qbbwUmRzRuBq6OiI8CS4HlwDdpTIm1PCJOoBFg5wDnzeW1+8GalcMGlyS12WEHe0TExcAIjSmqPgUMAJ8FTj/M4z4HvB54eUTsAS4GXh8RJ9P4QvW3gV8FyMydEXEtjUEcTwMXZuZE9TzvBrYAi4ANmblz1nspSaqtVkYtvhVYCdwKkJl7I+Ilh3tQZp47RfmTM2x/KXDpFPXrgetb6FOStAC1MmrxqcxMGkdRkytGS5LUF1oJsmsj4v8AQxHxKzRWh/5EZ9uSJKk1M81+/3Hg6sz8SEScCXyfxudkv5uZW7vVoCRJM5npM7JvAR+JiGOAa2mE2vbutCVJUmumPbWYmX+cmT9FYzmX7wIbIuKfI+LiiPjxrnUoSdIMDvsZWWben5l/kJkrgXOBNcDdHe9MkqQWtLKMy+KI+PmIuAr4K2AXcOiq0ZIk9cRMgz3OpHEEdhaNWTauAS7IzFaWcJEkqStmGuyxDrgaeH9mfq9L/UiSNCszzbX4hm42IknSXLTyhWhJkvqWQSZJKppBJkkqmkEmSSpaK8u4qE02bR91hWhJajODrEs2bR9l3cYdjO+fAGB0bJx1G3cAGGaSNA+eWuyS9Vt2HQixSeP7J1i/ZVePOpKkejDIumTv2Pis6pKk1hhkXbJ0aHBWdUlSawyyLlm7agWDA4sOqg0OLGLtqhU96kiS6sHBHl0yOaDDUYuS1F4GWRetWTlscElSm3lqUZJUNINMklQ0g0ySVDSDTJJUNINMklQ0g0ySVDSDTJJUNINMklQ0g0ySVDSDTJJUNINMklQ0g0ySVDSDTJJUtI4FWURsiIhHIuLOptpREbE1Iu6pfh5Z1SMi/iQidkfEHRHx2qbHnF9tf09EnN+pfiVJZerkEdmngdWH1C4CbsjM5cAN1W2ANwHLq8sFwOXQCD7gYuBU4BTg4snwkyQJOhhkmfm3wGOHlM8GrqyuXwmsaap/JhtuAoYi4hhgFbA1Mx/LzO8BW3luOEqSFrBuf0Z2dGY+VF3/DnB0dX0YeLBpuz1Vbbq6JElADwd7ZGYC2a7ni4gLImJbRGzbt29fu55WkvqW73sN3Q6yh6tThlQ/H6nqo8Cypu2OrWrT1Z8jM6/IzJHMHFmyZEnbG5ekfuP7XkO3g2wzMDny8HzgS031X6pGL54GPF6dgtwCvDEijqwGebyxqkmSBMDiTj1xRHwOeD3w8ojYQ2P04WXAtRHxLuB+4Beqza8HzgJ2A08A7wTIzMci4veBW6rtfi8zDx1AIklawDoWZJl57jR3nTHFtglcOM3zbAA2tLE1SVKNOLOHJKloBpkkqWgGmSSpaAaZJKloBpkkqWgGmSSpaAaZJKloBpkkqWgGmSSpaAaZJKloBpkkqWgGmSSpaAaZJKloBpkkqWgGmSSpaAaZJKloBpkkqWgGmSSpaIt73UC/2LR9lPVbdrF3bJylQ4OsXbWCNSuHe92WJOkwDDIaIbZu4w7G908AMDo2zrqNOwAMM0nqc55aBNZv2XUgxCaN759g/ZZdPepIktQqgwzYOzY+q7okqX8YZMDSocFZ1SVJ/cMgA9auWsHgwKKDaoMDi1i7akWPOpIktcrBHjw7oMNRi5JUHoOssmblsMElSQXy1KIkqWgGmSSpaAaZJKloBpkkqWgO9ugi53OUpPYzyCqdDhnnc5SkzvDUIs+GzOjYOMmzIbNp+2jbXsP5HCWpMwwyuhMyzucoSZ1hkNGdkHE+R0nqDIOM7oSM8zlKUmf0JMgi4tsRsSMibouIbVXtqIjYGhH3VD+PrOoREX8SEbsj4o6IeG27+1m7agUDR8RBtYEjoq0hs2blMB9+26sZHhokgOGhQT78tlc70EOS5qmXoxZ/JjMfbbp9EXBDZl4WERdVt38beBOwvLqcClxe/WyvOMztNnA+R0lqv346tXg2cGV1/UpgTVP9M9lwEzAUEce084XXb9nF/ok8qLZ/Ih1RKEkF6FWQJfD1iPiniLigqh2dmQ9V178DHF1dHwYebHrsnqp2kIi4ICK2RcS2ffv2zaoZRxRKKtF83vfqpFdB9rrMfC2N04YXRsRPN9+ZmUkj7FqWmVdk5khmjixZsmRWzTiiUFKJ5vO+Vyc9CbLMHK1+PgJ8ETgFeHjylGH185Fq81FgWdPDj61qbeOIwv60afsop192Iydc9FVOv+zGtn5BXVJ9dD3IIuJFEfGSyevAG4E7gc3A+dVm5wNfqq5vBn6pGr14GvB40ynItnBEYf/pxmwrkuqhF6MWjwa+GBGTr391Zn4tIm4Bro2IdwH3A79QbX89cBawG3gCeGcnmnJEYX+ZabYV/ztJatb1IMvMe4HXTFH/LnDGFPUELuxCa+ojDsCR1Kp+Gn4vHeAAHEmtMsjUlxyAI6lVrkemvjT5OZgLkUo6HINMc9KN1a4dgCOpFQaZZs3VriX1Ez8j06y52rWkfmKQadYcGi+pn3hqsdKNz3zqYunQIKNThJZD46WD+b7SHR6R0fhlW3vd7QdNh7T2utudDmkaDo2XDs9p1rrHIAM+9OWdU65H9qEv7+xRR/3NuSmlw/Oz5O7x1CLwvSf2z6o+V3U6zeDQeGlm3f4s+bEfPtWR5y2BQdYlm7aPsvbzt7P/mcaR3+jYOGs/fztQ5pD1OoWy1Al+ltw9nlrskks27zwQYpP2P5Ncsrm9py+7sYaX5/7VSXVZh27tqhXPeYM9oqqrvQyyLhkbn/o05XT1uehWwHjuX51Spz+Stt3/GM8cUnumqqu9DLIa6VbA1Ol7ZHX5678u6vRH0mdvemBWdc2dn5HVSLcC5qWDA1MeSb50cKCtr9NpTrXVf+r0R5K6xyOyGpkuSNodMI3FvVuvz9XvbNrBieuu5/iLvsqJ667ndzbtaOvz1+mv/7pwHbr5ufrmhXm0Z5DVSLcCZmyaryVMV5+L39m0g8/e9AAT2RggM5HJZ296oK1h5l///WftqhUMHHHwL+zAEeEACc3IIKuRbn0fbnBg6l+b6epz8bmbH5xVfS5eME2/09XVHZN/vEx3WzqU/8dq1safPnQs1sz1uZjuzaudb2pPTtPvdHV13ge/uINDvqXCM9moS9NxsIdmbbosKe0P50PfMA9XV+f98KmJWdXnwy/114dBpllbFDHlkdGidn8Y12FHxNShdURZu6E5cMRqvXhqUbN27qnLZlWfi+lCsZ1h+fzFU//6T1dXfdR5xOpCHLnoEZlm7X+seTXQGHgxkcmiCM49ddmBejt04zOyf9s/9Wdh09XnylNY/ccRq/Xin56ak5EfO4offekLCOBHX/oCRn7sqLY+/5EvnPq7b9PV56Ib31mq05RLdeL31erFINOsTc7kf9BCpJ9v70Kk3RhQ0o0FQut8CqtkLg5bLwaZZq0bM/k/Ps1kytPV56IbC4R6Cqs/uThsvfgZmWatGzP5d2stp04vEOqaVP2rzovDNg/4OO/U43rYSXd4RKa+dPzLpn6jn67erzyFJXWeR2TqS/9w79RrNk1Xn6tOjyicfC5HLUqdY5CpL3VjsMem7aP8xl/exuRTjo6N8xt/eRvQ3i/F1vkUltQPDDItWL913e0cmotZ1Q2e+vP7ffVhkGnBempi6sO76eqqj8mvkEyOvp38Cgk4RVWJDDKpw/zLv//M9BWSuv23KW3KqrmMsjTIpA7atH2Utdfdzv6Jpr/8r/Mv/17rxldI1D0Ov5c66ENf3nkgxCbtn0g+9OX2fXlcWuiKCbKIWB0RuyJid0Rc1Ot+pFZ0a9VuaSErIsgiYhHwceBNwEnAuRFxUm+7kiT1gyKCDDgF2J2Z92bmU8A1wNk97kmS1AdKGewxDDzYdHsPcGrzBhFxAXABwHHH1X9uMZVhaHBgygEEQ4PtW45GC9eh73sLYV7FqZRyRHZYmXlFZo5k5siSJUt63Y4EwCVveRUDRxy8qvXAEcElb3lVjzpSnfi+11BKkI0Cy5puH1vV2mL5j7xoVvW5WByzq/fra0B3/r268RpHv+R5s6rPxZqVw6x/+2sOWi5k/dtf49D7Hjv9xKkXgp2urv4W2c7J6zokIhYD3wLOoBFgtwDnZeaUY5hHRkZy27Zts3qNMz/6N9zzyA8P3F7+Iy9i62++fq4tT+kV677K003/3IsDdn/454p7DejOv1c3XuPUS7fy8A+eOnD76Jc8j5s/eGZbX0Ozc/xFX31O7duXtf93+B2f+Ef+/l+enYT69BOP4qpf+am2vkYb9qXlP0Pn8r7X51re9yKCDCAizgL+CFgEbMjMS6fbtob/QSUtTAZZC0oZ7EFmXg9c3+s+JEn9pZTPyCRJmpJBJkkqmkEmSSqaQSZJKppBJkkqmkEmSSqaQSZJKppBJkkqWjEze8xGROwD7u91HzN4OfBor5tok7rsS132A+qzL3XZD5j7vjyamatb2TAivtbqtnVTyyDrdxGxLTNHet1HO9RlX+qyH1CffanLfkC99qUfeWpRklQ0g0ySVDSDrDeu6HUDbVSXfanLfkB99qUu+wH12pe+42dkkqSieUQmSSqaQSZJKppB1iURsSwivhERd0XEzoh4b697mq+IWBQR2yPiK73uZT4iYigirouIf46IuyOivevdd0lE/Eb1u3VnRHwuIl7Q655aFREbIuKRiLizqXZURGyNiHuqn0f2ssdWTbMv66vfrzsi4osRMdTLHuvGIOuep4H3Z+ZJwGnAhRFxUo97mq/3Anf3uok2+GPga5n5SuA1FLhPETEMvAcYycyfABYB5/S2q1n5NHDol3kvAm7IzOXADdXtEnya5+7LVuAnMvMngW8B67rdVJ0ZZF2SmQ9l5q3V9R/QeLMc7m1XcxcRxwI/B/x5r3uZj4h4KfDTwCcBMvOpzBzrbVdzthgYjIjFwAuBvT3up2WZ+bfAY4eUzwaurK5fCazpalNzNNW+ZObXM/Pp6uZNwLFdb6zGDLIeiIjjgZXAzb3tZF7+CPgt4JleNzJPJwD7gE9Vp0n/PCJe1OumZiszR4GPAA8ADwGPZ+bXe9vVvB2dmQ9V178DHN3LZtrol4G/6nUTdWKQdVlEvBj4AvC+zPx+r/uZi4h4M/BIZv5Tr3tpg8XAa4HLM3Ml8EPKOYV1QPX50dk0gnkp8KKI+K+97ap9svE9oeK/KxQRH6TxMcNVve6lTgyyLoqIARohdlVmbux1P/NwOvCWiPg2cA3whoj4bG9bmrM9wJ7MnDw6vo5GsJXmZ4H7MnNfZu4HNgL/scc9zdfDEXEMQPXzkR73My8R8d+ANwPvSL/A21YGWZdERND4HObuzPxor/uZj8xcl5nHZubxNAYU3JiZRf71n5nfAR6MiBVV6Qzgrh62NFcPAKdFxAur37UzKHDQyiE2A+dX188HvtTDXuYlIlbTOBX/lsx8otf91I1B1j2nA79I4+jltupyVq+bEgC/DlwVEXcAJwP/s8f9zFp1RHkdcCuwg8b/28VMixQRnwP+EVgREXsi4l3AZcCZEXEPjSPOy3rZY6um2Zf/BbwE2Fr9v/+/e9pkzThFlSSpaB6RSZKKZpBJkopmkEmSimaQSZKKZpBJkopmkEltVM2k/2tNt19f+uoAUr8zyKT2GgJ+7bBbSWobg0wLVkQcX60R9emI+FZEXBURPxsRf1+tgXVKtSbWpmodqZsi4ierx15SrTv1NxFxb0S8p3ray4ATqy+mQNBmAAABIklEQVS9rq9qL25a7+yqauYNSW2yuNcNSD32CuDtNGYkvwU4D3gd8BbgA8CDwPbMXBMRbwA+Q2P2D4BXAj9DY8aGXRFxOY0Jh38iM0+GxqlFGisdvIrGsip/T2OWl7/rxs5JC4FHZFro7svMHZn5DLCTxkKOSWOap+NphNpfAGTmjcDLIuLfVY/9amY+mZmP0pjQdrplRr6ZmXuq17itel5JbWKQaaF7sun6M023n+HwZyyaHzsxw/atbidpDgwyaWb/D3gHHDhN+Ohh1pH7AY1TjZK6xL8MpZldAmyoZsZ/gmeXFZlSZn63GixyJ41VgL/a+Ralhc3Z7yVJRfPUoiSpaAaZJKloBpkkqWgGmSSpaAaZJKloBpkkqWgGmSSpaP8f9Q+XMQRrspEAAAAASUVORK5CYII=\n",
      "text/plain": [
       "<Figure size 432x432 with 3 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "plt.figure(figsize=(10,6))\n",
    "sns.jointplot(x='month',y='Vehicles',data=df_transport)\n",
    "#plt.title('Vehicles by Month')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Data Quality Issue #3:  \n",
    "##### Rename a Feature Column and Remove a Value.  \n",
    "\n",
    "Our feature columns have different \"capitalizations\" in their names, e.g. both upper and lower \"case\".  In addition, there are \"spaces\" in some of the column names.  In addition, we are only interested in years greater than 2006, not \"<2006\".  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Lab Task #3a:** Remove all the spaces for feature columns by renaming them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
     {
      "data": {
       "text/html": [
        "<div>\n",
        "<style scoped>\n",
        "    .dataframe tbody tr th:only-of-type {\n",
        "        vertical-align: middle;\n",
        "    }\n",
        "\n",
        "    .dataframe tbody tr th {\n",
        "        vertical-align: top;\n",
        "    }\n",
        "\n",
        "    .dataframe thead th {\n",
        "        text-align: right;\n",
        "    }\n",
        "</style>\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>date</th>\n",
        "      <th>zipcode</th>\n",
        "      <th>modelyear</th>\n",
        "      <th>fuel</th>\n",
        "      <th>make</th>\n",
        "      <th>lightduty</th>\n",
        "      <th>vehicles</th>\n",
        "      <th>year</th>\n",
        "      <th>month</th>\n",
        "      <th>day</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>0</th>\n",
        "      <td>2018-10-01</td>\n",
        "      <td>90000.0</td>\n",
        "      <td>2006</td>\n",
        "      <td>Gasoline</td>\n",
        "      <td>OTHER/UNK</td>\n",
        "      <td>Yes</td>\n",
        "      <td>1.0</td>\n",
        "      <td>2018</td>\n",
        "      <td>10</td>\n",
        "      <td>1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1</th>\n",
        "      <td>2018-10-01</td>\n",
        "      <td>90001.0</td>\n",
        "      <td>2014</td>\n",
        "      <td>Gasoline</td>\n",
        "      <td>OTHER/UNK</td>\n",
        "      <td>Yes</td>\n",
        "      <td>1.0</td>\n",
        "      <td>2018</td>\n",
        "      <td>10</td>\n",
        "      <td>1</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "text/plain": [
        "        date  zipcode modelyear      fuel       make lightduty  vehicles  \\\n",
        "0 2018-10-01  90000.0      2006  Gasoline  OTHER/UNK       Yes       1.0   \n",
        "1 2018-10-01  90001.0      2014  Gasoline  OTHER/UNK       Yes       1.0   \n",
        "\n",
        "   year  month  day  \n",
        "0  2018     10    1  \n",
        "1  2018     10    1  "
       ]
      },
      "execution_count": 26,
      "metadata": {},
      "output_type": "execute_result"
     }
   ],
   "source": [
    "# TODO 3a\n",
    "# TODO -- Your code here.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " **Note:** Next we create a copy of the dataframe to avoid the \"SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame\" warning.  Run the cell to remove the value '<2006' from the modelyear feature column. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Lab Task #3b:** Create a copy of the dataframe to avoid copy warning issues."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO 3b\n",
    "# TODO -- Your code here.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, confirm that the modelyear value '<2006' has been removed by doing a value count."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
     {
      "data": {
       "text/plain": [
        "2007    53\n",
        "2008    45\n",
        "2006    36\n",
        "2010    34\n",
        "2014    31\n",
        "2015    30\n",
        "2017    29\n",
        "2016    29\n",
        "2013    27\n",
        "2009    25\n",
        "2012    25\n",
        "2011    24\n",
        "2018    23\n",
        "2019     5\n",
        "Name: modelyear, dtype: int64"
       ]
      },
      "execution_count": 28,
      "metadata": {},
      "output_type": "execute_result"
     }
   ],
   "source": [
    "df['modelyear'].value_counts(0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Data Quality Issue #4:  \n",
    "##### Handling Categorical Columns\n",
    "\n",
    "The feature column \"lightduty\" is categorical and has a \"Yes/No\" choice.  We cannot feed values like this into a machine learning model.  We need to convert the binary answers from strings of yes/no to integers of 1/0.  There are various methods to achieve this.  We will use the \"apply\" method with a lambda expression.  Pandas. apply() takes a function and applies it to all values of a Pandas series.\n",
    "\n",
    "##### What is a Lambda Function?\n",
    "\n",
    "Typically, Python requires that you define a function using the def keyword. However, lambda functions are anonymous -- which means there is no need to name them. The most common use case for lambda functions is in code that requires a simple one-line function (e.g. lambdas only have a single expression).  \n",
    "\n",
    "As you progress through the Course Specialization, you will see many examples where lambda functions are being used.  Now is a good time to become familiar with them."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, lets count the number of \"Yes\" and\"No's\" in the 'lightduty' feature column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
     {
      "data": {
       "text/plain": [
        "Yes    374\n",
        "No      42\n",
        "Name: lightduty, dtype: int64"
       ]
      },
      "execution_count": 29,
      "metadata": {},
      "output_type": "execute_result"
     }
   ],
   "source": [
    "df['lightduty'].value_counts(0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's convert the Yes to 1 and No to 0. Pandas. apply()  . apply takes a function and applies it to all values of a Pandas series (e.g. lightduty). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
     {
      "data": {
       "text/plain": [
        "1    374\n",
        "0     42\n",
        "Name: lightduty, dtype: int64"
       ]
      },
      "execution_count": 30,
      "metadata": {},
      "output_type": "execute_result"
     }
   ],
   "source": [
    "df.loc[:,'lightduty'] = df['lightduty'].apply(lambda x: 0 if x=='No' else 1)\n",
    "df['lightduty'].value_counts(0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
     {
      "data": {
       "text/html": [
        "<div>\n",
        "<style scoped>\n",
        "    .dataframe tbody tr th:only-of-type {\n",
        "        vertical-align: middle;\n",
        "    }\n",
        "\n",
        "    .dataframe tbody tr th {\n",
        "        vertical-align: top;\n",
        "    }\n",
        "\n",
        "    .dataframe thead th {\n",
        "        text-align: right;\n",
        "    }\n",
        "</style>\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>date</th>\n",
        "      <th>zipcode</th>\n",
        "      <th>modelyear</th>\n",
        "      <th>fuel</th>\n",
        "      <th>make</th>\n",
        "      <th>lightduty</th>\n",
        "      <th>vehicles</th>\n",
        "      <th>year</th>\n",
        "      <th>month</th>\n",
        "      <th>day</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>0</th>\n",
        "      <td>2018-10-01</td>\n",
        "      <td>90000.0</td>\n",
        "      <td>2006</td>\n",
        "      <td>Gasoline</td>\n",
        "      <td>OTHER/UNK</td>\n",
        "      <td>1</td>\n",
        "      <td>1.0</td>\n",
        "      <td>2018</td>\n",
        "      <td>10</td>\n",
        "      <td>1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1</th>\n",
        "      <td>2018-10-01</td>\n",
        "      <td>90001.0</td>\n",
        "      <td>2014</td>\n",
        "      <td>Gasoline</td>\n",
        "      <td>OTHER/UNK</td>\n",
        "      <td>1</td>\n",
        "      <td>1.0</td>\n",
        "      <td>2018</td>\n",
        "      <td>10</td>\n",
        "      <td>1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3</th>\n",
        "      <td>2018-10-01</td>\n",
        "      <td>90000.0</td>\n",
        "      <td>2017</td>\n",
        "      <td>Gasoline</td>\n",
        "      <td>OTHER/UNK</td>\n",
        "      <td>1</td>\n",
        "      <td>1.0</td>\n",
        "      <td>2018</td>\n",
        "      <td>10</td>\n",
        "      <td>1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>16</th>\n",
        "      <td>2018-10-09</td>\n",
        "      <td>90001.0</td>\n",
        "      <td>2006</td>\n",
        "      <td>Diesel and Diesel Hybrid</td>\n",
        "      <td>Type_C</td>\n",
        "      <td>0</td>\n",
        "      <td>16.0</td>\n",
        "      <td>2018</td>\n",
        "      <td>10</td>\n",
        "      <td>9</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>17</th>\n",
        "      <td>2018-10-10</td>\n",
        "      <td>90001.0</td>\n",
        "      <td>2006</td>\n",
        "      <td>Diesel and Diesel Hybrid</td>\n",
        "      <td>OTHER/UNK</td>\n",
        "      <td>0</td>\n",
        "      <td>23.0</td>\n",
        "      <td>2018</td>\n",
        "      <td>10</td>\n",
        "      <td>10</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "text/plain": [
        "         date  zipcode modelyear                      fuel       make  \\\n",
        "0  2018-10-01  90000.0      2006                  Gasoline  OTHER/UNK   \n",
        "1  2018-10-01  90001.0      2014                  Gasoline  OTHER/UNK   \n",
        "3  2018-10-01  90000.0      2017                  Gasoline  OTHER/UNK   \n",
        "16 2018-10-09  90001.0      2006  Diesel and Diesel Hybrid     Type_C   \n",
        "17 2018-10-10  90001.0      2006  Diesel and Diesel Hybrid  OTHER/UNK   \n",
        "\n",
        "    lightduty  vehicles  year  month  day  \n",
        "0           1       1.0  2018     10    1  \n",
        "1           1       1.0  2018     10    1  \n",
        "3           1       1.0  2018     10    1  \n",
        "16          0      16.0  2018     10    9  \n",
        "17          0      23.0  2018     10   10  "
       ]
      },
      "execution_count": 31,
      "metadata": {},
      "output_type": "execute_result"
     }
   ],
   "source": [
    "# Confirm that \"lightduty\" has been converted.\n",
    "\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### One-Hot Encoding Categorical Feature Columns\n",
    "\n",
    "Machine learning algorithms expect input vectors and not categorical features. Specifically, they cannot handle text or string values.  Thus, it is often useful to transform categorical features into vectors.\n",
    "\n",
    "One transformation method is to create dummy variables for our categorical features.  Dummy variables are a set of binary (0 or 1) variables that each represent a single class from a categorical feature.  We simply  encode the categorical variable as a one-hot vector, i.e. a vector where only one element is non-zero, or hot.  With one-hot encoding, a categorical feature becomes an array whose size is the number of possible choices for that feature."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Panda provides a function called \"get_dummies\" to convert a categorical variable into dummy/indicator variables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
     {
      "data": {
       "text/html": [
        "<div>\n",
        "<style scoped>\n",
        "    .dataframe tbody tr th:only-of-type {\n",
        "        vertical-align: middle;\n",
        "    }\n",
        "\n",
        "    .dataframe tbody tr th {\n",
        "        vertical-align: top;\n",
        "    }\n",
        "\n",
        "    .dataframe thead th {\n",
        "        text-align: right;\n",
        "    }\n",
        "</style>\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>zipcode</th>\n",
        "      <th>modelyear_2007</th>\n",
        "      <th>modelyear_2008</th>\n",
        "      <th>modelyear_2009</th>\n",
        "      <th>modelyear_2010</th>\n",
        "      <th>modelyear_2011</th>\n",
        "      <th>modelyear_2012</th>\n",
        "      <th>modelyear_2013</th>\n",
        "      <th>modelyear_2014</th>\n",
        "      <th>modelyear_2015</th>\n",
        "      <th>...</th>\n",
        "      <th>make_Type_P</th>\n",
        "      <th>make_Type_Q</th>\n",
        "      <th>make_Type_R</th>\n",
        "      <th>make_Type_S</th>\n",
        "      <th>make_Type_T</th>\n",
        "      <th>make_Type_U</th>\n",
        "      <th>make_Type_V</th>\n",
        "      <th>make_Type_W</th>\n",
        "      <th>make_Type_X</th>\n",
        "      <th>make_Type_Y</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>0</th>\n",
        "      <td>90000.0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>...</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1</th>\n",
        "      <td>90001.0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>1</td>\n",
        "      <td>0</td>\n",
        "      <td>...</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3</th>\n",
        "      <td>90000.0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>...</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>16</th>\n",
        "      <td>90001.0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>...</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>17</th>\n",
        "      <td>90001.0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>...</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "<p>5 rows × 49 columns</p>\n",
        "</div>"
       ],
       "text/plain": [
        "    zipcode  modelyear_2007  modelyear_2008  modelyear_2009  modelyear_2010  \\\n",
        "0   90000.0               0               0               0               0   \n",
        "1   90001.0               0               0               0               0   \n",
        "3   90000.0               0               0               0               0   \n",
        "16  90001.0               0               0               0               0   \n",
        "17  90001.0               0               0               0               0   \n",
        "\n",
        "    modelyear_2011  modelyear_2012  modelyear_2013  modelyear_2014  \\\n",
        "0                0               0               0               0   \n",
        "1                0               0               0               1   \n",
        "3                0               0               0               0   \n",
        "16               0               0               0               0   \n",
        "17               0               0               0               0   \n",
        "\n",
        "    modelyear_2015  ...  make_Type_P  make_Type_Q  make_Type_R  make_Type_S  \\\n",
        "0                0  ...            0            0            0            0   \n",
        "1                0  ...            0            0            0            0   \n",
        "3                0  ...            0            0            0            0   \n",
        "16               0  ...            0            0            0            0   \n",
        "17               0  ...            0            0            0            0   \n",
        "\n",
        "    make_Type_T  make_Type_U  make_Type_V  make_Type_W  make_Type_X  \\\n",
        "0             0            0            0            0            0   \n",
        "1             0            0            0            0            0   \n",
        "3             0            0            0            0            0   \n",
        "16            0            0            0            0            0   \n",
        "17            0            0            0            0            0   \n",
        "\n",
        "    make_Type_Y  \n",
        "0             0  \n",
        "1             0  \n",
        "3             0  \n",
        "16            0  \n",
        "17            0  \n",
        "\n",
        "[5 rows x 49 columns]"
       ]
      },
      "execution_count": 32,
      "metadata": {},
      "output_type": "execute_result"
     }
   ],
   "source": [
    "# Making dummy variables for categorical data with more inputs.  \n",
    "\n",
    "data_dummy = pd.get_dummies(df[['zipcode','modelyear', 'fuel', 'make']], drop_first=True)\n",
    "data_dummy.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Lab Task #4a:** Merge (concatenate) original data frame with 'dummy' dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
     {
      "data": {
       "text/html": [
        "<div>\n",
        "<style scoped>\n",
        "    .dataframe tbody tr th:only-of-type {\n",
        "        vertical-align: middle;\n",
        "    }\n",
        "\n",
        "    .dataframe tbody tr th {\n",
        "        vertical-align: top;\n",
        "    }\n",
        "\n",
        "    .dataframe thead th {\n",
        "        text-align: right;\n",
        "    }\n",
        "</style>\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>date</th>\n",
        "      <th>zipcode</th>\n",
        "      <th>modelyear</th>\n",
        "      <th>fuel</th>\n",
        "      <th>make</th>\n",
        "      <th>lightduty</th>\n",
        "      <th>vehicles</th>\n",
        "      <th>year</th>\n",
        "      <th>month</th>\n",
        "      <th>day</th>\n",
        "      <th>...</th>\n",
        "      <th>make_Type_P</th>\n",
        "      <th>make_Type_Q</th>\n",
        "      <th>make_Type_R</th>\n",
        "      <th>make_Type_S</th>\n",
        "      <th>make_Type_T</th>\n",
        "      <th>make_Type_U</th>\n",
        "      <th>make_Type_V</th>\n",
        "      <th>make_Type_W</th>\n",
        "      <th>make_Type_X</th>\n",
        "      <th>make_Type_Y</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>0</th>\n",
        "      <td>2018-10-01</td>\n",
        "      <td>90000.0</td>\n",
        "      <td>2006</td>\n",
        "      <td>Gasoline</td>\n",
        "      <td>OTHER/UNK</td>\n",
        "      <td>1</td>\n",
        "      <td>1.0</td>\n",
        "      <td>2018</td>\n",
        "      <td>10</td>\n",
        "      <td>1</td>\n",
        "      <td>...</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1</th>\n",
        "      <td>2018-10-01</td>\n",
        "      <td>90001.0</td>\n",
        "      <td>2014</td>\n",
        "      <td>Gasoline</td>\n",
        "      <td>OTHER/UNK</td>\n",
        "      <td>1</td>\n",
        "      <td>1.0</td>\n",
        "      <td>2018</td>\n",
        "      <td>10</td>\n",
        "      <td>1</td>\n",
        "      <td>...</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3</th>\n",
        "      <td>2018-10-01</td>\n",
        "      <td>90000.0</td>\n",
        "      <td>2017</td>\n",
        "      <td>Gasoline</td>\n",
        "      <td>OTHER/UNK</td>\n",
        "      <td>1</td>\n",
        "      <td>1.0</td>\n",
        "      <td>2018</td>\n",
        "      <td>10</td>\n",
        "      <td>1</td>\n",
        "      <td>...</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>16</th>\n",
        "      <td>2018-10-09</td>\n",
        "      <td>90001.0</td>\n",
        "      <td>2006</td>\n",
        "      <td>Diesel and Diesel Hybrid</td>\n",
        "      <td>Type_C</td>\n",
        "      <td>0</td>\n",
        "      <td>16.0</td>\n",
        "      <td>2018</td>\n",
        "      <td>10</td>\n",
        "      <td>9</td>\n",
        "      <td>...</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>17</th>\n",
        "      <td>2018-10-10</td>\n",
        "      <td>90001.0</td>\n",
        "      <td>2006</td>\n",
        "      <td>Diesel and Diesel Hybrid</td>\n",
        "      <td>OTHER/UNK</td>\n",
        "      <td>0</td>\n",
        "      <td>23.0</td>\n",
        "      <td>2018</td>\n",
        "      <td>10</td>\n",
        "      <td>10</td>\n",
        "      <td>...</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "<p>5 rows × 59 columns</p>\n",
        "</div>"
       ],
       "text/plain": [
        "         date  zipcode modelyear                      fuel       make  \\\n",
        "0  2018-10-01  90000.0      2006                  Gasoline  OTHER/UNK   \n",
        "1  2018-10-01  90001.0      2014                  Gasoline  OTHER/UNK   \n",
        "3  2018-10-01  90000.0      2017                  Gasoline  OTHER/UNK   \n",
        "16 2018-10-09  90001.0      2006  Diesel and Diesel Hybrid     Type_C   \n",
        "17 2018-10-10  90001.0      2006  Diesel and Diesel Hybrid  OTHER/UNK   \n",
        "\n",
        "    lightduty  vehicles  year  month  day  ...  make_Type_P  make_Type_Q  \\\n",
        "0           1       1.0  2018     10    1  ...            0            0   \n",
        "1           1       1.0  2018     10    1  ...            0            0   \n",
        "3           1       1.0  2018     10    1  ...            0            0   \n",
        "16          0      16.0  2018     10    9  ...            0            0   \n",
        "17          0      23.0  2018     10   10  ...            0            0   \n",
        "\n",
        "    make_Type_R  make_Type_S  make_Type_T  make_Type_U  make_Type_V  \\\n",
        "0             0            0            0            0            0   \n",
        "1             0            0            0            0            0   \n",
        "3             0            0            0            0            0   \n",
        "16            0            0            0            0            0   \n",
        "17            0            0            0            0            0   \n",
        "\n",
        "    make_Type_W  make_Type_X  make_Type_Y  \n",
        "0             0            0            0  \n",
        "1             0            0            0  \n",
        "3             0            0            0  \n",
        "16            0            0            0  \n",
        "17            0            0            0  \n",
        "\n",
        "[5 rows x 59 columns]"
       ]
      },
      "execution_count": 33,
      "metadata": {},
      "output_type": "execute_result"
     }
   ],
   "source": [
    "# TODO 4a\n",
    "# TODO -- Your code here.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Lab Task #4b:** Drop attributes for which we made dummy variables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO 4b\n",
    "# TODO -- Your code here.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
     {
      "data": {
       "text/html": [
        "<div>\n",
        "<style scoped>\n",
        "    .dataframe tbody tr th:only-of-type {\n",
        "        vertical-align: middle;\n",
        "    }\n",
        "\n",
        "    .dataframe tbody tr th {\n",
        "        vertical-align: top;\n",
        "    }\n",
        "\n",
        "    .dataframe thead th {\n",
        "        text-align: right;\n",
        "    }\n",
        "</style>\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>lightduty</th>\n",
        "      <th>vehicles</th>\n",
        "      <th>year</th>\n",
        "      <th>month</th>\n",
        "      <th>day</th>\n",
        "      <th>modelyear_2007</th>\n",
        "      <th>modelyear_2008</th>\n",
        "      <th>modelyear_2009</th>\n",
        "      <th>modelyear_2010</th>\n",
        "      <th>modelyear_2011</th>\n",
        "      <th>...</th>\n",
        "      <th>make_Type_P</th>\n",
        "      <th>make_Type_Q</th>\n",
        "      <th>make_Type_R</th>\n",
        "      <th>make_Type_S</th>\n",
        "      <th>make_Type_T</th>\n",
        "      <th>make_Type_U</th>\n",
        "      <th>make_Type_V</th>\n",
        "      <th>make_Type_W</th>\n",
        "      <th>make_Type_X</th>\n",
        "      <th>make_Type_Y</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>0</th>\n",
        "      <td>1</td>\n",
        "      <td>1.0</td>\n",
        "      <td>2018</td>\n",
        "      <td>10</td>\n",
        "      <td>1</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>...</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1</th>\n",
        "      <td>1</td>\n",
        "      <td>1.0</td>\n",
        "      <td>2018</td>\n",
        "      <td>10</td>\n",
        "      <td>1</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>...</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3</th>\n",
        "      <td>1</td>\n",
        "      <td>1.0</td>\n",
        "      <td>2018</td>\n",
        "      <td>10</td>\n",
        "      <td>1</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>...</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>16</th>\n",
        "      <td>0</td>\n",
        "      <td>16.0</td>\n",
        "      <td>2018</td>\n",
        "      <td>10</td>\n",
        "      <td>9</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>...</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>17</th>\n",
        "      <td>0</td>\n",
        "      <td>23.0</td>\n",
        "      <td>2018</td>\n",
        "      <td>10</td>\n",
        "      <td>10</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>...</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "<p>5 rows × 53 columns</p>\n",
        "</div>"
       ],
       "text/plain": [
        "    lightduty  vehicles  year  month  day  modelyear_2007  modelyear_2008  \\\n",
        "0           1       1.0  2018     10    1               0               0   \n",
        "1           1       1.0  2018     10    1               0               0   \n",
        "3           1       1.0  2018     10    1               0               0   \n",
        "16          0      16.0  2018     10    9               0               0   \n",
        "17          0      23.0  2018     10   10               0               0   \n",
        "\n",
        "    modelyear_2009  modelyear_2010  modelyear_2011  ...  make_Type_P  \\\n",
        "0                0               0               0  ...            0   \n",
        "1                0               0               0  ...            0   \n",
        "3                0               0               0  ...            0   \n",
        "16               0               0               0  ...            0   \n",
        "17               0               0               0  ...            0   \n",
        "\n",
        "    make_Type_Q  make_Type_R  make_Type_S  make_Type_T  make_Type_U  \\\n",
        "0             0            0            0            0            0   \n",
        "1             0            0            0            0            0   \n",
        "3             0            0            0            0            0   \n",
        "16            0            0            0            0            0   \n",
        "17            0            0            0            0            0   \n",
        "\n",
        "    make_Type_V  make_Type_W  make_Type_X  make_Type_Y  \n",
        "0             0            0            0            0  \n",
        "1             0            0            0            0  \n",
        "3             0            0            0            0  \n",
        "16            0            0            0            0  \n",
        "17            0            0            0            0  \n",
        "\n",
        "[5 rows x 53 columns]"
       ]
      },
      "execution_count": 35,
      "metadata": {},
      "output_type": "execute_result"
     }
   ],
   "source": [
    "# Confirm that 'zipcode','modelyear', 'fuel', and 'make' have been dropped.\n",
    "\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Data Quality Issue #5:  \n",
    "##### Temporal Feature Columns\n",
    "\n",
    "Our dataset now contains year, month, and day feature columns.  Let's convert the month and day feature columns to meaningful representations as a way to get us thinking about changing temporal features -- as they are sometimes overlooked.  \n",
    "\n",
    "Note that the Feature Engineering course in this Specialization will provide more depth on methods to handle year, month, day, and hour feature columns.\n",
    "\n",
    "First, let's print the unique values for \"month\" and \"day\" in our dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
       "Unique values of month: [10 11 12  1  2  3]\n",
       "Unique values of day: [ 1  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31\n",
       "  2  3  4  5  6  7  8]\n",
       "Unique values of year: [2018 2019]\n"
      ]
     }
   ],
   "source": [
    "print ('Unique values of month:',df.month.unique())\n",
    "print ('Unique values of day:',df.day.unique())\n",
    "print ('Unique values of year:',df.year.unique())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we map each temporal variable onto a circle such that the lowest value for that variable appears right next to the largest value. We compute the x- and y- component of that point using sin and cos trigonometric functions.   Don't worry, this is the last time we will use this code, as you can develop an input pipeline to address these temporal feature columns in TensorFlow and Keras - and it is much easier!  But, sometimes you need to appreciate what you're not going to encounter as you move through the course!\n",
    "\n",
    "Run the cell to view the output."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Lab Task #5:** Drop month, and day"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['day_sin'] = np.sin(df.day*(2.*np.pi/31))\n",
    "df['day_cos'] = np.cos(df.day*(2.*np.pi/31))\n",
    "df['month_sin'] = np.sin((df.month-1)*(2.*np.pi/12))\n",
    "df['month_cos'] = np.cos((df.month-1)*(2.*np.pi/12))\n",
    "\n",
    "# TODO 5\n",
    "# TODO -- Your code here.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
     {
      "data": {
       "text/html": [
        "<div>\n",
        "<style scoped>\n",
        "    .dataframe tbody tr th:only-of-type {\n",
        "        vertical-align: middle;\n",
        "    }\n",
        "\n",
        "    .dataframe tbody tr th {\n",
        "        vertical-align: top;\n",
        "    }\n",
        "\n",
        "    .dataframe thead th {\n",
        "        text-align: right;\n",
        "    }\n",
        "</style>\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>lightduty</th>\n",
        "      <th>vehicles</th>\n",
        "      <th>modelyear_2007</th>\n",
        "      <th>modelyear_2008</th>\n",
        "      <th>modelyear_2009</th>\n",
        "      <th>modelyear_2010</th>\n",
        "      <th>modelyear_2011</th>\n",
        "      <th>modelyear_2012</th>\n",
        "      <th>modelyear_2013</th>\n",
        "      <th>modelyear_2014</th>\n",
        "      <th>...</th>\n",
        "      <th>make_Type_T</th>\n",
        "      <th>make_Type_U</th>\n",
        "      <th>make_Type_V</th>\n",
        "      <th>make_Type_W</th>\n",
        "      <th>make_Type_X</th>\n",
        "      <th>make_Type_Y</th>\n",
        "      <th>day_sin</th>\n",
        "      <th>day_cos</th>\n",
        "      <th>month_sin</th>\n",
        "      <th>month_cos</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>495</th>\n",
        "      <td>1</td>\n",
        "      <td>58.0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>1</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>...</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0.724793</td>\n",
        "      <td>0.688967</td>\n",
        "      <td>-0.5</td>\n",
        "      <td>0.866025</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>496</th>\n",
        "      <td>1</td>\n",
        "      <td>45.0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>1</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>...</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0.848644</td>\n",
        "      <td>0.528964</td>\n",
        "      <td>-0.5</td>\n",
        "      <td>0.866025</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>497</th>\n",
        "      <td>1</td>\n",
        "      <td>82.0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>1</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>...</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0.937752</td>\n",
        "      <td>0.347305</td>\n",
        "      <td>-0.5</td>\n",
        "      <td>0.866025</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>498</th>\n",
        "      <td>1</td>\n",
        "      <td>12.0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>1</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>...</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0</td>\n",
        "      <td>0.988468</td>\n",
        "      <td>0.151428</td>\n",
        "      <td>-0.5</td>\n",
        "      <td>0.866025</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "<p>4 rows × 54 columns</p>\n",
        "</div>"
       ],
       "text/plain": [
        "     lightduty  vehicles  modelyear_2007  modelyear_2008  modelyear_2009  \\\n",
        "495          1      58.0               0               0               0   \n",
        "496          1      45.0               0               0               0   \n",
        "497          1      82.0               0               0               0   \n",
        "498          1      12.0               0               0               0   \n",
        "\n",
        "     modelyear_2010  modelyear_2011  modelyear_2012  modelyear_2013  \\\n",
        "495               1               0               0               0   \n",
        "496               1               0               0               0   \n",
        "497               1               0               0               0   \n",
        "498               1               0               0               0   \n",
        "\n",
        "     modelyear_2014  ...  make_Type_T  make_Type_U  make_Type_V  make_Type_W  \\\n",
        "495               0  ...            0            0            0            0   \n",
        "496               0  ...            0            0            0            0   \n",
        "497               0  ...            0            0            0            0   \n",
        "498               0  ...            0            0            0            0   \n",
        "\n",
        "     make_Type_X  make_Type_Y   day_sin   day_cos  month_sin  month_cos  \n",
        "495            0            0  0.724793  0.688967       -0.5   0.866025  \n",
        "496            0            0  0.848644  0.528964       -0.5   0.866025  \n",
        "497            0            0  0.937752  0.347305       -0.5   0.866025  \n",
        "498            0            0  0.988468  0.151428       -0.5   0.866025  \n",
        "\n",
        "[4 rows x 54 columns]"
       ]
      },
      "execution_count": 38,
      "metadata": {},
      "output_type": "execute_result"
     }
   ],
   "source": [
    "# scroll left to see the converted month and day coluumns.\n",
    "df.tail(4)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Conclusion\n",
    "\n",
    "This notebook introduced a few concepts to improve data quality.  We resolved missing values, converted the Date feature column to a datetime format, renamed feature columns, removed a value from a feature column, created one-hot encoding features, and converted temporal features to meaningful representations.  By the end of our lab, we gained an understanding as to why data should be \"cleaned\" and \"pre-processed\" before input into a machine learning model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copyright 2021 Google Inc.\n",
    "Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at\n",
    "http://www.apache.org/licenses/LICENSE-2.0\n",
    "Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License."
   ]
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "name": "Basic Feature Engineering in Keras.ipynb",
   "provenance": [],
   "toc_visible": true
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
